Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: better dtype inference when doing DataFrame reductions #52788

Merged
merged 79 commits into from
Jul 13, 2023

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Apr 19, 2023

Supercedes #52707 and possibly #52261, depending on reception.

This PR improves on #52707 by not attempting to infer the combined dtypes from the original dtypes and builds heavily on the ideas in #52261, but avoids the overloading of kwargs in the extensionarray functions/methods in that PR. Instead this keeps all the work in ExtensionArray._reduce by adding an explicit keepdims param to that method signature. This makes this implementation simpler than #52261 IMO.

This PR is not finished, and would I appreciate feedback about the direction here compared to #52261 before I proceed. This works currently correctly for masked arrays & ArrowArrays (AFAIKS), but still need the other extensionArray subclasses (datetimelike arrays & Categorical). I've written some tests, but need to write more + write the docs.

CC: @jbrockmendel & @rhshadrach

@jbrockmendel
Copy link
Member

The reason for the gymnastics/checking for keepdims in #52261 is bc 3rd party EAs won't have implemented keepdims, so will need a deprecation cycle to catch up (though in general I'd be OK with going less far out of our way with this kind of thing)

@@ -1310,6 +1312,12 @@ def pyarrow_meth(data, skip_nulls, **kwargs):
if name == "median":
# GH 52679: Use quantile instead of approximate_median; returns array
result = result[0]

if keepdims:
# TODO: is there a way to do this without .as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jorisvandenbossche suggestions? (also i expect you'll have thoughts on the big-picture approach in this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche gentle ping, i don't want to move forward here without your OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just convert a scalar, so perforemancewise the effect from this is limited. Would be worse if it was a whole array being converted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using as_py() is the sensible thing to do here to convert a scalar to a len-1 array (until pyarrow supports it's own scalars in pa.array([..]))

(we have internal functionality in C++ to convert a scalar to a len-1 array, but not sure that is worth exposing in pyarrow)

@topper-123 topper-123 force-pushed the reduction_dtypes_II branch from 53e6241 to 6c25d56 Compare April 19, 2023 21:45
@topper-123 topper-123 marked this pull request as ready for review April 20, 2023 05:53
@topper-123
Copy link
Contributor Author

I've refactored to use _reduce_with_wrap method instead of a keyword. This avoids the deprercation issue raised by @jbrockmendel.

@jbrockmendel
Copy link
Member

I've refactored to use _reduce_with_wrap method instead of a keyword. This avoids the deprercation issue

This still requires downstream authors to implement something new, just avoids issuing a warning to inform them about it.

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 20, 2023

Actually it is backward compatible: If they don't implement this, their scalar just gets wrapped in a numpy array, so will work unchanged. So it's backward compatible, they just don't get this improved type inference without implementing _reduce_with_wrap in their arrays.

Comment on lines 1116 to 1125
if self.dtype.kind == "f":
np_dtype = "float64"
elif name in ["mean", "median", "var", "std", "skew"]:
np_dtype = "float64"
elif self.dtype.kind == "i":
np_dtype = "int64"
elif self.dtype.kind == "u":
np_dtype = "uint64"
else:
raise TypeError(self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should corr, sem, and kurt be part of the first elif? And is this not hit by e.g. a sum of Boolean?

In general I'm not a fan of encoding op-specific logic like this. It can lead to bugs if behavior is changed in one place but not the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points, I'll have to look into this again to see if I can do it differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - nice! So this wasn't necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah It was needed in the first draft, then I changed the code making it unnecessary.

There is still some things missing in this PR. I'll update today.

Comment on lines 200 to 201
elif method in ["prod", "sum"]:
kwargs["min_count"] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to change?

Copy link
Contributor Author

@topper-123 topper-123 Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input array (arr2d) contains NA values. Previously we were adding scalars, so e.g. sum([pd.NA]) would give pd.NA, even with min_count=0.

In the new version, pandas will be able to correctly give result of 0, if min_count=0 while correctly giving result pd.NA, with min_count=1.

So this could be considered a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; is there a test for min_count=0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's isn't currently for the 2-dim case, only the 1-dim case. I can add it, but that will have to be tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this.

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 24, 2023

I've updated, so it now works correctly with pyarrow and categorical dtypes. Datetimelikes still missing, likewise docs and maybe some additional tests. So now we can do (notice the dtype):

>>> arr = pd.array([1, 2], dtype="int64[pyarrow]")
>>> df = pd.DataFrame({"a": arr, "b": arr})
>>> df.median()
a    1.5
b    1.5
dtype: double[pyarrow]

This PR has been rebased to be based on #52890. Could be an idea to get that merged

@topper-123 topper-123 force-pushed the reduction_dtypes_II branch from c1a1a39 to 5297cf3 Compare April 26, 2023 05:45
@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc. labels Apr 26, 2023
@topper-123 topper-123 force-pushed the reduction_dtypes_II branch from 6f76e2c to 346f043 Compare May 1, 2023 15:50
Comment on lines 112 to 122
def _simple_new(cls, values: np.ndarray, mask: npt.NDArray[np.bool_]) -> Self:
result = BaseMaskedArray.__new__(cls)
result._data = values
result._mask = mask
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to save from doing the isinstance check when calling __init__ with copy=False?

Copy link
Contributor Author

@topper-123 topper-123 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from #53013 (I built this on top of that to see what performance improvements I get here), so can be discussed independently in #53013.

and yes, this is to avoid the checks, when checks are not needed. For this PR i need BaseMaskedArray.reshape to be fast and this helps with that (and many other cases).

Comment on lines 200 to 201
elif method in ["prod", "sum"]:
kwargs["min_count"] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; is there a test for min_count=0?

@@ -507,6 +507,38 @@ def test_reduce_series(self, data, all_numeric_reductions, skipna, request):
request.node.add_marker(xfail_mark)
super().test_reduce_series(data, all_numeric_reductions, skipna)

def check_reduce_and_wrap(self, ser, op_name, skipna):
if op_name in ["count", "kurt", "sem", "skew"]:
pytest.skip(f"{op_name} not an array method")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assert with hasattr to ensure the skip reason is valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -367,6 +372,30 @@ def check_reduce(self, s, op_name, skipna):
expected = bool(expected)
tm.assert_almost_equal(result, expected)

def check_reduce_and_wrap(self, ser: pd.Series, op_name: str, skipna: bool):
if op_name in ["count", "kurt", "sem"]:
pytest.skip(f"{op_name} not an array method")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

cmp_dtype = "boolean"
elif op_name in ["sum", "prod"]:
is_windows_or_32bit = is_platform_windows() or not IS64
cmp_dtype = "Int32" if skipna and is_windows_or_32bit else "Int64"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have expected skipna to have an influence here; is this deliberate or an existing issue?

Copy link
Contributor Author

@topper-123 topper-123 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a big struggle to get a handle on how the dtype end up changing on different systems, so this is definitely not deliberate on my part, just trying to get the tests to pass, then it will be easier to fix up whatever needs to be fixed.

But I think you are probably right, I can look into it tomorrow. Having said that, I don't think I've changed anything in this regard, so if this is a bug, this just uncovers an existing bug and it shouldn't be something that has been introduced in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this, so skipna doesn't affect the dtype.

Also: "I don't think I've changed anything in this regard." are always famous last words. My PR was the cause of this, so good catch.

@topper-123
Copy link
Contributor Author

After the various perf PRs ( #52998, #53013 & #53040), I now get unchanged perf for axis=0 when calling asv continuous -f 1.1 upstream/main HEAD -b stat_ops.FrameOps:

       before           after         ratio
     [4f14b456]       [5760ee38]
     <master>         <reduction_dtypes_II>
+      3.90±0.02s       12.7±0.01s     3.25  stat_ops.FrameOps.time_op('median', 'Int64', 1)
+      4.22±0.01s       6.68±0.01s     1.58  stat_ops.FrameOps.time_op('kurt', 'Int64', 1)
+         3.96±0s       5.86±0.01s     1.48  stat_ops.FrameOps.time_op('skew', 'Int64', 1)
+      2.94±0.01s       3.82±0.01s     1.30  stat_ops.FrameOps.time_op('mean', 'Int64', 1)
+         2.34±0s       3.03±0.01s     1.29  stat_ops.FrameOps.time_op('sum', 'Int64', 1)
+      2.35±0.01s       3.01±0.01s     1.28  stat_ops.FrameOps.time_op('prod', 'Int64', 1)
+      3.70±0.01s       4.58±0.01s     1.24  stat_ops.FrameOps.time_op('std', 'Int64', 1)
+      3.63±0.01s       4.50±0.01s     1.24  stat_ops.FrameOps.time_op('var', 'Int64', 1)
-     3.46±0.03ms      3.08±0.02ms     0.89  stat_ops.FrameOps.time_op('median', 'Int64', 0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

That means that axis=0 is good now, while for axis=1 is slower because it's in effect the same as doing df.T.sum(axis=0) on a (4, 100_000,)-shaped DataFrame, where the transpose is very slow (but can be implroved by #52836 or #52083) + there is a lot of work to extract masked data from 100_000 ExtensionBlocks. So I think this is as good as it gets for axis=1, unless we can avoid transposing the dataframe, which will speed things up tremendously, but that's unrelated to this PR.

@topper-123
Copy link
Contributor Author

I've improved the median with axis=1 case, so it now falls in line with the other axis=1 reductions

       before           after         ratio
     [4f14b456]       [122488b1]
     <master>         <reduction_dtypes_II>
+      4.25±0.01s       6.68±0.02s     1.57  stat_ops.FrameOps.time_op('kurt', 'Int64', 1)
+      4.64±0.05s          6.42±0s     1.38  stat_ops.FrameOps.time_op('sem', 'Int64', 1)
+         2.35±0s       3.04±0.01s     1.29  stat_ops.FrameOps.time_op('sum', 'Int64', 1)
+      2.96±0.01s          3.81±0s     1.29  stat_ops.FrameOps.time_op('mean', 'Int64', 1)
+      3.60±0.04s          4.53±0s     1.26  stat_ops.FrameOps.time_op('var', 'Int64', 1)
+      3.70±0.01s          4.65±0s     1.26  stat_ops.FrameOps.time_op('std', 'Int64', 1)
+      3.92±0.01s       4.85±0.02s     1.24  stat_ops.FrameOps.time_op('median', 'Int64', 1)
+     1.36±0.05ms      1.50±0.07ms     1.11  stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
-     3.45±0.07ms      3.02±0.04ms     0.88  stat_ops.FrameOps.time_op('median', 'Int64', 0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Generally, in the axis=1 cases in the tests, we now have to concatenating the reduction result for the 100_000 masked arrays. Before we only had to concatenate 100_000 numpy arrays, which was inherently faster. OTOH doing concatenation over 100_000 masked arrays is always slow, so can't become much faster without a significant rewrite which will be out of scope here.

Length: 1, dtype: Int64
"""
result = self._reduce(name, skipna=skipna, **kwargs)
return np.array([[result]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a 2D result. Should it be 1D?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moved the existing behavior from pandas.core.internals.blocks.Block.reduce to here. Having said that, I think you are right, 1D makes more sense here and it passes all tests also, so I've changed it.

return result

def _wrap_na_result(self, *, name, axis):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like it got a lot more complicated than the first attempt at "keepdims". does this just address more corner cases the first attempt missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not great, and I've been tinkered quite a lot with other implementations. I think/hope I will have a simpler solution tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I looked into this and unless I do a relatively major refactoring, it looks like I just move complexity around by changing this. So without a bigger rework of arrays_algos, I don't see any clearly better method.

Suggestions/ideas that show otherwise welcome, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add, the underlying issue is that the functions in masked_reductions.py only return scalar NA values. This means that we don't get the type information when doing reductions that return NA, but have to infer it and the inferring is complex because it depends on calling method, dtype of calling data and OS platform.

The solution would be to return proper 2D results from masked_reductions.py functions, e.g. return a tuple instead of scalar, e.g.

  1. return (scalar_value, none), when returning scalar
  2. return (value_array, mask_array) when returning a masked array

and the wrap the result in an array in BaseMaskedArray._wrap_reduction_result, when it should be a masked array.

However, I 'm not sure if that's the direction we want to pursue because unless we want to support 2d masked arrays, the current solution could still be simpler than building out 2d reduction support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term, that indeed sounds as a good alternative, even for the current 1D-reshaped-as-2D case. One potential disadvantage is that you still need to do some actual calculation on the data, unnecessarily, to get the value_array return value (so we can use that to determine the result dtype). Calculating the actual value is "unnecessary" if you know the result will be masked. Of course it avoids hardcoding result dtypes. But for the case of skipna=False with a large array, that might introduce a slowdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only "know" because of the hard-coding here, for example Series([1, 2, pd.NA], dtype="int8").sum() is Int32 on some systems and Int64 on others. So to avoid the hardcoding we will have to do a calculation, so it's a trade-off (or we could hard-code the dtype when we're 100 % sure we get a NA, but that'll be about the same complexity as now + the extra code for doing real 2d ops). I think we should only do the 2d ops if we want it independent of this issue here.

IMO that's a possible future PR, if we choose to go in that direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only "know" because of the hard-coding here

In the end, also numpy kind of "hard codes" this, but just in the implementation itself. And we get the difficulty of combining the algo of numpy with our own logic, where we also want to know this information. So I don't think it's necessarily "wrong" to hard code this on our side as well (i.e. essentially keep it as you are doing in this PR, also long term (at least as long as the arrays are only 1D)).

Sidenote, from seeing the additional complexity for 32bit systems in deciding the result dtype, I do wonder if we actually want to get rid of that system-dependent behaviour? (we also don't follow numpy's behaviour in the constructors, but always default to int64)
Although given that we still call numpy for the actual operation, it probably doesn't reduce complexity to ensure this dtype guarantee (we would need to move the 32bit check to the implementation of the masked sum)

@jbrockmendel
Copy link
Member

id expect this to be very close to perf-neutral. where is the slowdown coming from?

@topper-123
Copy link
Contributor Author

id expect this to be very close to perf-neutral. where is the slowdown coming from?

It is performance neutral when the dataframe is not-wide, but when the dataframe gets very wide, there will be a certain slowdown, because pandas will have to create 1 new masked array per column and then join those masked arrays. Joining masked arrays is slower than joining the same number of ndarrays, because the masked array have two ndarrays internally + the masked arrays are written in python while the ndarrays are written in C, which gives some overhead for masked arrays.

Having said that, note that the ASV slowdowns that I show is about joining 100.000 columns, i.e quite a lot. I think the performance is actually ok, taking the above explanation into account.

@jbrockmendel
Copy link
Member

I haven't re-reviewed in a while, but I was happy here as of the last time I looked at it. If Joris is on board, let's do it.

@topper-123
Copy link
Contributor Author

I've actually just this morning made a new version using _reduce only (i.e. scrapping _reduce_and_wrap)`. I prefer this new version (because we now only have one reduction methods instead of two), but if the other is preferred, it is easy to revert back.

I maintain backward compat in this new version by:

  1. making the keepdim parameter keyword only
  2. only calling it with keepdims=True if the _reduce method signature has a parameter named "keepdims"
  3. if the _reduce method signature does not have a parameter named "keepdims", _reduce gets called without supplying the keepdims parameter, we emit a FutureWarning and take care of wrapping the (scalar) reduction result in a ndarray before passing it on.

This is possible because _reduce_and_wrap was actually only called inside the blk_func inside DataFrame._reduce, so by doing some introspection there we can keep backward compat. See new version:

pandas/pandas/core/frame.py

Lines 10874 to 10895 in f85deab

def blk_func(values, axis: Axis = 1):
if isinstance(values, ExtensionArray):
if not is_1d_only_ea_dtype(values.dtype) and not isinstance(
self._mgr, ArrayManager
):
return values._reduce(name, axis=1, skipna=skipna, **kwds)
sign = signature(values._reduce)
if "keepdims" in sign.parameters:
return values._reduce(name, skipna=skipna, keepdims=True, **kwds)
else:
warnings.warn(
f"{type(values)}._reduce will require a `keepdims` parameter "
"in the future",
FutureWarning,
stacklevel=find_stack_level(),
)
result = values._reduce(name, skipna=skipna, kwargs=kwds)
return np.array([result])
else:
return op(values, axis=axis, skipna=skipna, **kwds)
.

Notice especially the FutureWarning starting on line 10885. This will allow us to not require keepdims now, even though keepdims is in the signature of ExtensionArray._reduce. In v3.0, we will drop the signature checking and only call values._reduce with keepdims=True, i.e. it will fail without a keepdims parameter in v3.0.

Check out the test_reduction_without_keepdims test in pandas/tests/extension/decimal/test_decimal.py for a test of what happens when extensionarrays don't have a keepdim parameter in their _reduce method.

Thoughts? I prefer this new version, but it's is easy to revert back if needed.

@topper-123
Copy link
Contributor Author

Ping...

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Just a single comment from me https://github.com/pandas-dev/pandas/pull/52788/files#r1261463068

@topper-123
Copy link
Contributor Author

I've updated, added some + fixed some dtype issues with DataFrame.any/all.

@mroeschke mroeschke added this to the 2.1 milestone Jul 13, 2023
["median", Series([2, 2], index=["B", "C"], dtype="Float64")],
["var", Series([2, 2], index=["B", "C"], dtype="Float64")],
["std", Series([2**0.5, 2**0.5], index=["B", "C"], dtype="Float64")],
["skew", Series([pd.NA, pd.NA], index=["B", "C"], dtype="Float64")],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up if you could add kurt (if supported) that would be great

@mroeschke mroeschke merged commit dc830ea into pandas-dev:main Jul 13, 2023
@mroeschke
Copy link
Member

Thanks this is an awesome improvement @topper-123

@jbrockmendel
Copy link
Member

nice! thanks for sticking with this @topper-123

@topper-123 topper-123 deleted the reduction_dtypes_II branch July 13, 2023 17:24
@topper-123
Copy link
Contributor Author

Yes, but worth the slow process, I wouldn't have thought of this way to add a parameter to _reduce in a backward compatible way if this had gone faster 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Summing empty DataFrame can convert to incompatible type
9 participants